Capriccio: Scalable Threads for Internet Services

Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer

U.California, Berkeley

SOSP 2003

Presented by: Jiayang Li / 李伽扬

Main Contribution

  • Scalable user-level thread package
    • Alternative to
      • Event-based models
      • Kernel-thread models
  • Scalability to 100,000 threads
  • Efficient for Internet Servers

Key Features

  • Scalability with user-level threads
    • Cooperative scheduling
    • Asynchronous disk I/O
    • Efficient thread operations - O(1)
  • Linked stack management
  • Resource-aware scheduling

Outline

  • Related Work and “Debate”
  • Capriccio Scalability with user-level threads
  • Linked Stack Management
  • Resource-Aware Scheduling

Debate – event-based side

  • Event-based arguments by Ousterhout (Why threads are bad?, 1996)
    • Events are more efficient (no context switching and locking overheads with threads)
    • Threads - hard to program (deadlocks, synchronization)
    • Poor thread support (portability, debugging)
  • Many event-based implementation (Harvest, Flash, SEDA)

Debate – other arguments

  • Neutral argument by Lauer and Needham (On the duality of OS system structures, 1978)
    • Any system constructed according to a model can have a direct counterpart in the other model
  • Pro-thread arguments by Behren, Condit, Brewer (Why events are bad?, 2003)
    • Greater code readability
    • No “stack-ripping”
    • Slow thread performance - implementation artifact
    • High performance servers more sensitive to scheduling

Capriccio

  • Philosophy
    • Thread model is useful
    • Improve implementation to remove barriers to scalability
  • Techniques
    • User-level threads
    • Linked stack management
    • Resource aware scheduling
  • Tools
    • Compiler-analysis
    • Run-time monitoring

Why user-level threads?

  • Decoupling from the OS/kernel
    • OS independence
    • Kernel variation
    • Address application-specific needs 1
  • Cooperative threading – more efficient synchronization
  • Less “kernel crossing”
  • More efficient memory management

Implementation

  • Non-blocking wrappers for blocking I/O
  • Asynchronous disk I/O where possible 1
  • Scheduling loop resembling an event-driven application
  • Cheap synchronization
  • Efficient O(1) thread operations

Benchmarks

  • (left) Capriccio scales to 100,000 threads
  • (right) Network I/O throughput with Capriccio only has 10% overhead over epoll

Benchmarks

  • With asynchronous I/O disk performance is comparable in Capriccio vs. other thread packages

Disadvantages of user-level threads

  • Non-blocking wrappers of blocking I/O increase kernel crossings
  • Extra overhead for function calls
  • Difficult to integrate with multiple processor scheduling

Linked Stacks

  • Problem: Conservative stack allocations per thread are unsuitable for programs with many threads.
  • Solution: Dynamic stack allocation with linked chunks alleviates VM pressure and improves paging behavior.
  • Method: Compile-time analysis and checkpoint injection into the code.

Linked Stacks: Algorithm

  • Each node is a call site annotated with the maximum stack space for that call.
  • Checkpoints must be inserted at each recursive frame and well-spaced call sites.
  • Checkpoints determine whether to allocate a new stack chunk.

Challenging cases

  • Function pointers are only determined at runtime
  • It is more difficult to bound the stack space used by precompiled libraries.

Apache 2.0.44 Benchmark

  • Given 2 KB “max-path” only 10.6% call sites required check-pointing code.
  • Overhead in the number of instructions was 3-4%.

Scheduling: Blocking Graph

  • Lessons from event systems
    • Break app into stages
    • Schedule based on stage priorities
  • Capriccio does this for threads
    • Deduce stage with stack traces at blocking points
    • Prioritize based on runtime information

Resource-Aware Scheduling

  • Track resources used along BG edges
    • Memory, file descriptors, CPU
    • Predict future from the past
  • Algorithm
    • Increase use when underutilized
    • Decrease use near saturation
  • Advantages
    • Operate near the knee w/o thrashing
    • Automatic admission control

Pitfalls

  • Maximum capacity of a particular resource is difficult to determine
  • Thrashing is not easily detectable.
  • Non-yielding threads lead to unfairness and starvation in cooperative scheduling.
  • Blocking graphs are expensive to maintain (for Apache 2.0.44 stack trace overhead is 8% of execution time).

Web Server Performance

  • Apache 2.0.44 on a 4x500 MHz Pentium server has 15% higher throughput with Capriccio.

Conclusion

  • Capriccio demonstrates a user-level thread package that achieves
    • High scalability
    • Efficient stack management
    • Scheduling based on resource usage
  • Drawbacks
    • High overhead in stack tracing
    • Lack of sufficient multi-processor support

Future Work

  • Extending Capriccio to work with multiple processors
  • Reducing the kernel crossings with batching asynchronous network I/O
  • Disambiguate function pointers in stack allocation
  • Improving resource-aware scheduling
    • Tracking variance in resource usage
    • Better detection of thrashing

Thank You!

Q & A